1 Data Acquisition

In this report, we will study the microarray expression of patients with stage III melanoma. More particularly, we will use the dataset to predict if a patient is still alive or not. First of all, we obtain the dataset GSE54467 from Gene Expression Omnibus which we can store as an Expression Set object using the Biobase package. The phenotype data contains some relevant information for each patient and can be visualized in the table below.

set.seed(2022) #set the seed to 2022 to obtain the same results every time the code is ran
library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)

# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)

gset <- getGEO("GSE54467", GSEMatrix = TRUE, getGPL = FALSE, destdir = ".")[[1]]


## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))
head(pheno)

However, the most important data for our analysis us the gene expression which we can extract from the gset object using the exprs() function. Looking at the boxplot representing the distribution of the first five genes we can see that the data has already been normalised.

## Get gene expression data
expVal <- exprs(gset)

boxplot(expVal[, 1:5])

## Get gene annotations
geneAnno <- fData(gset)

The Dataframe for the 26,085 genes and 79 pateints can be analysed below.

as.data.frame(expVal)

We add a new column called survClass which we set to “Good” if the patient is alive and to “Poor” if the patient died from melanoma. This column will be very helpful when fitting our logistic model later on. A count for each binary outcome is shown below.

pheno$survClass <- NA
pheno$survClass[pheno$survival/12 >= 4 & pheno$status == "Alive NSR"] <- "Good"
pheno$survClass[pheno$survival/12 <= 1 & pheno$status == "Dead Melanoma"] <- "Poor"

table(pheno$survClass)
## 
## Good Poor 
##   25   22

2 Differential Expression Analysis

Now that all the required pre-processing has been done, we are ready to start with our DE analysis. We used the limma package to create a design matrix which performes gene expression to compare patient from Good and Poor survival class. A linear model is fitted to create and an emperical Bayes moderation of the standard errors and thus get a better estimates of gne-wise variability.

design = model.matrix(~survClass, pheno)
fit = lmFit(gset[, rownames(design)], design, method = "robust")  
efit = eBayes(fit, trend = TRUE)

Finally, we used the efit object to determine the number of up and down regulated genes which can be visualized in the table below, as well as in the volcano plot which highlight the 10 most significant genes

## Top genes
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
##        (Intercept) survClassPoor
## Down             0           111
## NotSig           0         25950
## Up           26085            24
volcanoplot(efit, coef = "survClassPoor", highlight = 10, names = efit$genes$SYMBOL)

## Pull out top 10 genes
top10 <- topTable(efit, n = 10)
## Removing intercept from test coefficients
top10

The top 10 up and down regulated genes have also been saved in top10 which we will use to obtain our data to fit a logistic regression model.

3 Logistic Regression

Recall, that genes were previously stored as row. Ho ever, when performing logistic regression, we want to predict an outcome for each row given the value of different columns. Thus we use the function t() to get a transpose matrix where the expression of the top 10 genes are the columns and the 47 patients are the rows. An other column Y was appended to the dataframe as a factor which is our binary outcome variable.

X <- t(expVal[rownames(top10), rownames(design)])
Y <- as.factor(pheno[rownames(design), "survClass"])

data <- data.frame(Y, X)
data

Using the glm function we fit a full logistic regression model. Here R is trying to predict the outcome Y for each patient using all 10 genes. Lokking at the P-value below, we can see that some genes are not significant.

lreg = glm(Y ~., data, family = binomial)
lreg %>% summary()
## 
## Call:
## glm(formula = Y ~ ., family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5271  -0.6472  -0.3043   0.6394   1.8662  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  18.46745   29.98511   0.616   0.5380  
## ILMN_2157441 -0.18357    0.62976  -0.291   0.7707  
## ILMN_1698367 -0.02228    1.30426  -0.017   0.9864  
## ILMN_1655867  0.63261    0.83134   0.761   0.4467  
## ILMN_1734855 -2.04628    4.56069  -0.449   0.6537  
## ILMN_1768814  2.62271    1.75727   1.492   0.1356  
## ILMN_1667232 -0.39624    1.04975  -0.377   0.7058  
## ILMN_1813379 -1.21282    1.77497  -0.683   0.4944  
## ILMN_2049184 -0.40712    0.68892  -0.591   0.5546  
## ILMN_1815205 -0.99546    0.53890  -1.847   0.0647 .
## ILMN_2359800  0.04893    0.73773   0.066   0.9471  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.964  on 46  degrees of freedom
## Residual deviance: 41.193  on 36  degrees of freedom
## AIC: 63.193
## 
## Number of Fisher Scoring iterations: 5

We used the MASS package to fit a backward-model to predict the Y outcome. We can see that only the genes ILMN_1815205 was kept and the AIC is now lower indicating that the backward-model performs better than the full model.

library(MASS)
step.model <- lreg %>% stepAIC(trace = FALSE)
step.model %>% summary()
## 
## Call:
## glm(formula = Y ~ ILMN_1815205, family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7137  -0.7401  -0.3288   0.7789   2.6736  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    9.1659     2.7371   3.349 0.000812 ***
## ILMN_1815205  -0.8833     0.2633  -3.355 0.000794 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.964  on 46  degrees of freedom
## Residual deviance: 46.929  on 45  degrees of freedom
## AIC: 50.929
## 
## Number of Fisher Scoring iterations: 4

The equation of the model can be visualised below using the equatiomatic package:

equatiomatic::extract_eq(step.model, use_coefs = TRUE)

\[ \log\left[ \frac { \widehat{P( \operatorname{Y} = \operatorname{Poor} )} }{ 1 - \widehat{P( \operatorname{Y} = \operatorname{Poor} )} } \right] = 9.17 - 0.88(\operatorname{ILMN\_1815205}) \]

The log-odds of belonging to a poor survival class is negatively correlated with the expression of the gene ILMN_1815205.

4 5-fold Cross Validation

4.1 Full-model

Using the caret package, we can perform a 5-fold cross validation on the full model.

library(caret)

train_control = trainControl(method = "cv", number = 5)

model = train(Y ~., data = data, trControl = train_control, method = "glm", family = binomial())

model
## Generalized Linear Model 
## 
## 47 samples
## 10 predictors
##  2 classes: 'Good', 'Poor' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 37, 38, 38, 37, 38 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6844444  0.3615385

4.2 Step-wise model

Similarly, we can perform a 5-cross validation on the step-wise model.

train_control2 = trainControl(method = "cv", number = 5)

model2 = train(Y ~., data = data, trControl = train_control2, method = "glmStepAIC", family = binomial(), trace= FALSE)

model2
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 47 samples
## 10 predictors
##  2 classes: 'Good', 'Poor' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 38, 38, 38, 37, 37 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7266667  0.4496534

As expected, the Stepwise feature selection model is performing better than the full model.

5 Conclusion

Consequently, we built a logistic regression model which obtains a score of ~0.7 with a 5-fold cross validation when predicting if a stage III melanoma patient is going to survive or not. This model only require the expression of one gene and thus it might be used as a predictor for survival as long as the procedure to obtain the gene expresion is not to invassive.